AITopics | exploratory policy

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL

Neural Information Processing SystemsMar-21-2026, 14:23:57 GMT

In order to mitigate the sample complexity of real-world reinforcement learning, common practice is to first train a policy in a simulator where samples are cheap, and then deploy this policy in the real world, with the hope that it generalizes effectively. Such \emph{direct sim2real} transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the simulator. In this work, we show that in many regimes, while direct sim2real transfer may fail, we can utilize the simulator to learn a set of \emph{exploratory} policies which enable efficient exploration in the real world. In particular, in the setting of low-rank MDPs, we show that coupling these exploratory policies with simple, practical approaches---least-squares regression oracles and naive randomized exploration---yields a polynomial sample complexity in the real world, an exponential improvement over direct sim2real transfer, or learning without access to a simulator. To the best of our knowledge, this is the first evidence that simulation transfer yields a provable gain in reinforcement learning in settings where direct sim2real transfer fails. We validate our theoretical results on several realistic robotic simulators and a real-world robotic sim2real task, demonstrating that transferring exploratory policies can yield substantial gains in practice as well.

artificial intelligence, machine learning, reinforcement learning, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.50)

Add feedback

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL Andrew Wagenmaker

Neural Information Processing SystemsFeb-16-2026, 14:41:25 GMT

Such direct sim2real transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the simulator. In this work, we show that in many regimes, while direct sim2real transfer may fail, we can utilize the simulator to learn a set of exploratory policies which enable efficient exploration in the real world.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > California > Alameda County > Livermore (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Information Technology (0.67)
Leisure & Entertainment > Games > Computer Games (0.40)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.45)

Add feedback

c1b8bf9e071c0dabb899e7a27f353762-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 00:27:01 GMT

algorithm, assumption, international conference, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.14)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.47)
Information Technology > Data Science > Data Mining > Big Data (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.35)

Add feedback

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Neural Information Processing SystemsDec-24-2025, 20:52:22 GMT

Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds dilated bonuses to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit feedback, improving and generalizing the state-of-the-art.

adversarial mdp, improved exploration, policy optimization, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.59)

Add feedback

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL Andrew Wagenmaker

Neural Information Processing SystemsOct-10-2025, 09:29:12 GMT

Such direct sim2real transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the simulator. In this work, we show that in many regimes, while direct sim2real transfer may fail, we can utilize the simulator to learn a set of exploratory policies which enable efficient exploration in the real world.

exploration, probability, sim, (16 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > California > Alameda County > Livermore (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry:

Information Technology (0.67)
Leisure & Entertainment > Games > Computer Games (0.40)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.45)

Add feedback

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses Haipeng Luo Chen-Y u Wei

Neural Information Processing SystemsAug-17-2025, 05:17:57 GMT

Policy optimization is a widely-used method in reinforcement learning.

data mining, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.14)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.47)
Information Technology > Data Science > Data Mining > Big Data (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.35)

Add feedback

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL

Neural Information Processing SystemsMay-27-2025, 08:39:54 GMT

In order to mitigate the sample complexity of real-world reinforcement learning, common practice is to first train a policy in a simulator where samples are cheap, and then deploy this policy in the real world, with the hope that it generalizes effectively. Such \emph{direct sim2real} transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the simulator. In this work, we show that in many regimes, while direct sim2real transfer may fail, we can utilize the simulator to learn a set of \emph{exploratory} policies which enable efficient exploration in the real world. In particular, in the setting of low-rank MDPs, we show that coupling these exploratory policies with simple, practical approaches---least-squares regression oracles and naive randomized exploration---yields a polynomial sample complexity in the real world, an exponential improvement over direct sim2real transfer, or learning without access to a simulator. To the best of our knowledge, this is the first evidence that simulation transfer yields a provable gain in reinforcement learning in settings where direct sim2real transfer fails.

leveraging simulation, sim-to-real gap, simulator, (9 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games > Computer Games (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.52)

Add feedback

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Neural Information Processing SystemsJan-19-2025, 00:33:37 GMT

Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds dilated bonuses to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit feedback, improving and generalizing the state-of-the-art. When the number of states is infinite, under the assumption that the state-action values are linear in some low-dimensional features, we obtain \widetilde{\mathcal{O}}({T} {\frac{2}{3}}) regret with the help of a simulator, matching the result of Neu and Olkhovskaya [2020] while importantly removing the need of an exploratory policy that their algorithm requires.

assumption, improved exploration, policy optimization, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.61)

Add feedback

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL

Wagenmaker, Andrew, Huang, Kevin, Ke, Liyiming, Boots, Byron, Jamieson, Kevin, Gupta, Abhishek

arXiv.org Machine LearningOct-26-2024

In order to mitigate the sample complexity of real-world reinforcement learning, common practice is to first train a policy in a simulator where samples are cheap, and then deploy this policy in the real world, with the hope that it generalizes effectively. Such \emph{direct sim2real} transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the simulator. In this work, we show that in many regimes, while direct sim2real transfer may fail, we can utilize the simulator to learn a set of \emph{exploratory} policies which enable efficient exploration in the real world. In particular, in the setting of low-rank MDPs, we show that coupling these exploratory policies with simple, practical approaches -- least-squares regression oracles and naive randomized exploration -- yields a polynomial sample complexity in the real world, an exponential improvement over direct sim2real transfer, or learning without access to a simulator. To the best of our knowledge, this is the first evidence that simulation transfer yields a provable gain in reinforcement learning in settings where direct sim2real transfer fails. We validate our theoretical results on several realistic robotic simulators and a real-world robotic sim2real task, demonstrating that transferring exploratory policies can yield substantial gains in practice as well.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Machine Learning

2410.20254

Country:

North America > United States > California > Alameda County > Livermore (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (0.40)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

The Impact of Data Distribution on Q-learning with Function Approximation

Santos, Pedro P., Carvalho, Diogo S., Sardinha, Alberto, Melo, Francisco S.

arXiv.org Artificial IntelligenceFeb-10-2023

We study the interplay between the data distribution and Q-learning-based algorithms with function approximation. We provide a unified theoretical and empirical analysis as to how different properties of the data distribution influence the performance of Q-learning-based algorithms. We connect different lines of research, as well as validate and extend previous results. We start by reviewing theoretical bounds on the performance of approximate dynamic programming algorithms. We then introduce a novel four-state MDP specifically tailored to highlight the impact of the data distribution in the performance of Q-learning-based algorithms with function approximation, both online and offline. Finally, we experimentally assess the impact of the data distribution properties on the performance of two offline Q-learning-based algorithms under different environments. According to our results: (i) high entropy data distributions are well-suited for learning in an offline manner; and (ii) a certain degree of data diversity (data coverage) and data quality (closeness to optimal policy) are jointly desirable for offline learning.

data distribution, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2111.11758

Country: Europe > Portugal > Lisbon > Lisbon (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.84)

Add feedback

Filters

Collaborating Authors

exploratory policy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL Andrew Wagenmaker

c1b8bf9e071c0dabb899e7a27f353762-Paper.pdf

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL Andrew Wagenmaker

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses Haipeng Luo Chen-Y u Wei

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL

The Impact of Data Distribution on Q-learning with Function Approximation